1 General Information


Location: NARES Location India.

Study: Drought (Stress and non-stress)

Contact Person: Rainfed Breeding Team

Experimental Design: Augmented RCBD;

  • 4 blocks

  • 1 Replication, 322 entries and 12 checks.

  • Global Check: …………………………….

  • Local Check: ……………………

Season: Wet-season (WS).

Year: 2019.


NOTE: Due to IRRI’s data policies, the actual names of lines and complete metadata information is not given in this demo report. Also, the response variables has been modified in this demo.


2 Demo Data Set


  • Here demo data set will be uploaded and viewed as table.
  • The file contains stress (drought) and non-stress trials data for traits grain yield, plant height and days to flowering.
  • Besides blocks, information on columns and rows is also given in the data set.
> # Remove previous work
>   rm(list=ls())
> # Upload the demo data set
>   demo.data<-read.csv(file="~/Documents/GitHub/Analysis-pipeline/Data/demo.data.csv", 
+                     header = TRUE)
> # Convert variables into appropriate data types
>   demo.data$Genotype<-as.factor(demo.data$Genotype) # genotypes as factor
>   demo.data$Block<-as.factor(demo.data$Block) # block as factor
>   demo.data$Row<-as.factor(demo.data$Row) # Row as factor
>   demo.data$Column<-as.factor(demo.data$Column) # Column as factor
>   
> # View as table
>   print_table <- function(table, ...){
+   datatable(table, extensions = 'Buttons',
+           options = list(scrollX = TRUE, 
+                          dom = '<<t>Bp>',
+                          buttons = c('copy', 'excel', 'pdf', 'print')), ...)
+    }
>   print_table(demo.data[, c(1, 5,6,8,9,13)], editable = 'cell', rownames = FALSE, caption = htmltools::tags$caption("Table: Showing Yield Raw Data for stress and non-stress trials",style="color:black; font-size:130%"), filter = 'top')

3 Pre-processing of Data: Checking Quality of Data



3.1 Visualize the missing data.


  • Here we will check whether the data has any missing values
> # missing data count across all columns
>   demo.data[demo.data==0]<-NA # Converting any values with Zero into NA
>   na_count <-data.frame(missing.count=sapply(demo.data, function(y) sum(length(which(is.na(y))))))
> #  colSums(is.na(demo.data)) # alternative
>   na_count$Variables<-row.names(na_count)
> # Visualize it as bar plot
>   ggbarplot(na_count, x = "Variables", y = "missing.count",
+           fill="lightblue",
+           color = "lightblue",            # Set bar border colors to white
+           x.text.angle = 45           # Rotate vertically x axis texts
+           )+
+     labs(title="Missing Data Points for all Variables",x="Variables", y = "Count")+
+     theme (plot.title = element_text(color="black", size=12,hjust=0.5, face="bold"), # add and modify the title to plot
+                  axis.title.x = element_text(color="black", size=12), # add and modify title to x axis
+                  axis.title.y = element_text(color="black", size=12))

> # Let us see which one is missing for Plant Height
>   demo.data$Height[which(is.na(demo.data$Height))]
[1] NA
> # let us see the details on this
>   demo.data[216, ]
         Environment Abiotic.stress Year Plot Genotype Block Replication
216 Non.stress.trial        Drought 2019  213      216     3           1
    Row Column Line.type Days.to.flowering Height GYKGPHA
216   3     23     entry                76     NA  6138.2

Note: Missing data with plant height variable.


3.2 Descriptive Statistics


  • Here data is summarized as mean, median, standard deviation (SD), minimum, maximum, Coefficient of variation (CV) and standard error (SE) for grain yield, days to flowering, and plant height.
> # Summary for GRAIN YIELD
> summary.gykgpha<-data.frame(demo.data %>% 
+   group_by(Environment)%>% 
+   summarize(Mean = mean(GYKGPHA, na.rm=TRUE),
+             Median= median(GYKGPHA, na.rm=TRUE),
+               SD =sd(GYKGPHA, na.rm=TRUE),
+               Min.=min(GYKGPHA, na.rm=TRUE),
+                Max.=max(GYKGPHA, na.rm=TRUE),
+                CV=sd(GYKGPHA, na.rm=TRUE)/mean(GYKGPHA, na.rm=TRUE)*100,
+                St.err= sd(GYKGPHA, na.rm=TRUE)/sqrt(length(GYKGPHA))
+                ))
>   summary.gykgpha<-data.frame(lapply(summary.gykgpha, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> 
>   summary.gykgpha<-cbind(data.frame(Trait=c(rep("Yield", nrow(summary.gykgpha)))),summary.gykgpha )
> # Summary for FLOWERING DATA
> summary.flowering<-data.frame(demo.data %>% 
+   group_by(Environment)%>% 
+   summarize(Mean = mean(Days.to.flowering, na.rm=TRUE),
+             Median= median(Days.to.flowering, na.rm=TRUE),
+               SD =sd(Days.to.flowering, na.rm=TRUE),
+               Min.=min(Days.to.flowering, na.rm=TRUE),
+                Max.=max(Days.to.flowering, na.rm=TRUE),
+                CV=sd(Days.to.flowering, na.rm=TRUE)/mean(Days.to.flowering, na.rm=TRUE)*100,
+                St.err= sd(Days.to.flowering, na.rm=TRUE)/sqrt(length(Days.to.flowering))
+                ))
> summary.flowering<-data.frame(lapply(summary.flowering, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> summary.flowering<-cbind(data.frame(Trait=c(rep("Flowering", nrow(summary.flowering)))),summary.flowering )
> # Summary for PLANT HEIGHT
>   summary.height<-data.frame(demo.data %>% 
+   group_by(Environment)%>% 
+   summarize(Mean = mean(Height, na.rm=TRUE),
+             Median= median(Height, na.rm=TRUE),
+               SD =sd(Height, na.rm=TRUE),
+               Min.=min(Height, na.rm=TRUE),
+                Max.=max(Height, na.rm=TRUE),
+                CV=sd(Height, na.rm=TRUE)/mean(Height, na.rm=TRUE)*100,
+                St.err= sd(Height, na.rm=TRUE)/sqrt(length(Height))
+                ))
>   summary.height<-cbind(data.frame(Trait=c(rep("Height", nrow(summary.height)))),summary.height )
> # Now combine the all data summeries and view as table
>   summary.data<-rbind(summary.gykgpha, summary.flowering, summary.height)
>   summary.data<-data.frame(lapply(summary.data, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> # Add options to print and export
>   print_table(summary.data, rownames = FALSE,caption = htmltools::tags$caption("Data summary including mean, median, standard deviation (SD), coefficient of variation (CV), and standard error (St.err) for yield, days to flowering and plant Height.", style="color:black; font-size:130%"))

Note: Extremely high CV for grain yield under drought and lower values may be due to extreme drought.



3.3 Heat Maps of the Field Experimental Design


  • In this section we will visualize the field design for grain yield through heat map to get better idea about the design and spatial variations in the field.

Showing heat maps of field design under non-stress and drought environments. X axis shows the list of columns and y-axis the blocks or rows.

> 
> # For Drought data  
>   demo.data.dr<- subset(demo.data, Environment=="Stress.trial",select =c("Block", "Column", "GYKGPHA") )
>   demo.data.dr<-data.frame(demo.data.dr%>% group_by(Block)%>% arrange(Block) %>%arrange(Column))
>   demo.data.dr<-droplevels.data.frame(demo.data.dr)
>   demo.data.dr<-reshape(demo.data.dr, idvar = "Block", timevar = "Column", direction = "wide")
>   row.names(demo.data.dr)<-paste0("Block",  demo.data.dr$Block)
>   demo.data.dr<-data.matrix(demo.data.dr[,-1])
>   colnames(demo.data.dr) <- gsub(x = colnames(demo.data.dr), pattern = "GYKGPHA.", replacement = "") 
> 
> plot.gy.sa<-heatmaply(demo.data.dr, main = "Grain yield under stress (drought) trial",
+                 xlab = "Columns",
+                 ylab = "Rows",
+                 Rowv=FALSE,
+                 Colv = FALSE, cexRow = 0.8, cexCol = 0.6, na.value="white")
> plot.gy.sa
> 
> # For Non-stress Data
>   demo.data.ns<- subset(demo.data, Environment=="Non.stress.trial",select =c("Block", "Column", "GYKGPHA") )
>   demo.data.ns<-data.frame(demo.data.ns%>% group_by(Block)%>% arrange(Block) %>%arrange(Column))
>   demo.data.ns<-droplevels.data.frame(demo.data.ns)
>   demo.data.ns<-reshape(demo.data.ns, idvar = "Block", timevar = "Column", direction = "wide")
>   row.names(demo.data.ns)<-paste0("Block",  demo.data.ns$Block)
>   demo.data.ns<-data.matrix(demo.data.ns[,-1])
>   colnames(demo.data.ns) <- gsub(x = colnames(demo.data.ns), pattern = "GYKGPHA.", replacement = "") 
> 
> plot.gy.ns<-heatmaply(demo.data.ns, main = "Grain yield under non-stress trial",
+                 xlab = "Columns",
+                 ylab = "Rows",
+                 Rowv=FALSE,
+                 Colv = FALSE, cexRow = 0.8, cexCol = 0.6, na.value="white")
> plot.gy.ns

Note: Extreme blue shows low yield values and yellow are very high yield values.


3.4 Data Visualization


  • Here in this section we will visualize the data using Box plot, Histograms, and QQ plot.

3.4.1 Box plot

> # First let us visualize the data using boxplots
> myboxplot<- function(dataframe,x,y){
+    aaa <- enquo(x)
+    bbb <- enquo(y)
+    dfname <- enquo(dataframe)
+    dataframe %>%
+       filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
+       #group_by(!! aaa,!! bbb) %>%
+       #count() %>%
+      ggplot(aes_(fill=aaa, x=aaa, y=bbb))+ 
+      theme_classic()+
+           geom_boxplot()+
+      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +# fill by timepoint to give different color
+           #scale_fill_manual(values = c("", ""))+
+           #scale_color_manual(values = c("", ""))
+       theme (plot.title = element_text(color="black", size=12,hjust=0.5, face = "bold"), # add and modify the title to plot
+                  axis.title.x = element_text(color="black", size=12, face = "bold"), # add and modify title to x axis
+                  axis.title.y = element_text(color="black", size=12, face="bold")) + # add and modify title to y axis
+   #scale_y_continuous(limits=c(0,15000), breaks=seq(0,15000,1000), expand = c(0, 0))+
+           theme(axis.text= element_text(color = "black", size = 10))+ # modify the axis text
+           theme(legend.title = element_text(colour="black", size=16), legend.position = "none",
+                   legend.text = element_text(colour="black", size=14))+ # add and modify the legends
+                   guides(fill=guide_legend(title="Environments"))+
+   stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
+ stat_summary(fun=mean, geom="point")
+ }
> 
> # Now draw the box plot for yield
> p1<-boxplot.yield<-myboxplot(demo.data,x=Environment,y=GYKGPHA)+
+   labs(title="",x="Environments", y = "Grain Yield")+
+   stat_compare_means(method = "anova", label.x = 1.6, label.y = 10000)
> #p1<-ggplotly(boxplot.yield)
> 
> # Now draw the box plot for flowering
> p2<-boxplot.flowering<-myboxplot(demo.data,x=Environment,y=Days.to.flowering)+
+   labs(title="",x="Environments", y = "Days to flowering")+
+   stat_compare_means(method = "anova", label.x = 1.6, label.y = 130)
> #p2<-ggplotly(boxplot.flowering)
> 
> # Now draw the box plot height
> p3<-boxplot.height<-myboxplot(demo.data,x=Environment,y=Height)+
+   labs(title="",x="Environments", y = "Plant Height (cm)")+
+   stat_compare_means(method = "anova", label.x = 1.6, label.y = 167)
> #p3<-ggplotly(boxplot.height)
> #p1+p2+p3
> par(mfrow=c(1,3))
> p1<-ggplotly(p1)
> p2<-ggplotly(p2)
> p3<-ggplotly(p3)
> subplot(p1, p2, p3, nrows=1, margin = 0.05, titleY = TRUE)

Note: Significant difference between drought and non-stress observed for all traits, p-value is provided in on top of each plot. Outliers present for all traits

Histograms and QQ plots are also available for more diagnostics , click the buttons below

3.4.2 Histogram plots

  • Histograms for all traits to check distribution of data.
> par(mfrow=c(1,2))
> # For grain yield
> envi<-unique(demo.data$Environment)
> for(i in 1:length(envi)){
+   level_envi <- demo.data[which(demo.data$Environment==envi[i]),]
+   hist(level_envi$GYKGPHA, col = "pink", xlab="Grain yield",
+        main=paste(envi[i]))
+   
+ }
Showing histograms for Grain yield, flowering and plant height. Check the problem with flowering data.

Showing histograms for Grain yield, flowering and plant height. Check the problem with flowering data.

> 
> # For Flowering date
> envi<-unique(demo.data$Environment)
> for(i in 1:length(envi)){
+   level_envi <- demo.data[which(demo.data$Environment==envi[i]),]
+   hist(level_envi$Days.to.flowering, col = "pink", xlab="Days to flowering",
+        main=paste(envi[i]))
+   
+ }
Showing histograms for Grain yield, flowering and plant height. Check the problem with flowering data.

Showing histograms for Grain yield, flowering and plant height. Check the problem with flowering data.

> 
> # For Plant height
> envi<-unique(demo.data$Environment)
> for(i in 1:length(envi)){
+   level_envi <- demo.data[which(demo.data$Environment==envi[i]),]
+   hist(level_envi$Height, col = "pink", xlab="Plant Height (cm)",
+        main=paste(envi[i]))
+   
+ }
Showing histograms for Grain yield, flowering and plant height. Check the problem with flowering data.

Showing histograms for Grain yield, flowering and plant height. Check the problem with flowering data.

Showing histograms for grain yield, flowering and plant height.

3.4.3 QQ plots

  • We will draw the QQ plots to check the normality of the data. It is just to to see if our data assumptions are plausible.
  • We expect line to be straight, if it deviates then it indicates some issues with data. More information on QQ plots can be found here on this link
> ## QQ plots to check normality assumption
> # For the grain Yield
> par(mfrow=c(1,2))
> envi<-unique(demo.data$Environment)
> for(i in 1:length(envi)){
+   level_envi <- demo.data[which(demo.data$Envi==envi[i]),]
+   qqnorm(level_envi$GYKGPHA, pch = 1, frame = TRUE,  main=paste(envi[i],".Yield"))
+   qqline(level_envi$GYKGPHA, col = "steelblue", lwd = 2)
+ }

> 
> # For the days to flowering
> par(mfrow=c(1,2))
> envi<-unique(demo.data$Environment)
> for(i in 1:length(envi)){
+   level_envi <- demo.data[which(demo.data$Envi==envi[i]),]
+   qqnorm(level_envi$Days.to.flowering, pch = 1, frame = TRUE,  main=paste(envi[i],".Flowering"))
+   qqline(level_envi$Days.to.flowering, col = "steelblue", lwd = 2)
+ }

> 
> # For the days to plant height
> par(mfrow=c(1,2))
> envi<-unique(demo.data$Environment)
> for(i in 1:length(envi)){
+   level_envi <- demo.data[which(demo.data$Envi==envi[i]),]
+   qqnorm(level_envi$Height, pch = 1, frame = TRUE,  main=paste(envi[i],".Height"))
+   qqline(level_envi$Height, col = "steelblue", lwd = 2)
+ }

Grain yield under non-stress does not look good, so do the plant height and flowering date.


3.5 Identify and Remove Outliers


Note: Outliers may drastically change the estimates, ranking (BLUPs or BLUEs) and predictions!! Further reading Resource 1; Resource 2; Resource 3

  • As the data is in Augmenetd RCBD un-replicated uni-variate approach will be used to find the potential outliers.
  • Briefly, we will flag out the lines that lie outside 1.5*IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles, these are observation that are outside the whiskers in box plot. For more details on this please check this R blog.
  • So, in this section, a function named outlier.box which will first subset the data based on environment and replication, then we will get Quantiles, IQR and maximum and minimum values. Then we will flag out and extract the ones that fall outside the maximum and minimum value in the data.
  • More robust approach by integrating all the data across all the trials or locations to find potential outliers by the method called Bonferroni–Holm test is briefly described in this article. The R codes on this is available at Click the link.
  • Here we will be using Uni-variate approach as we are dealing with un-replicated data sets.
> # Univariate approach to falg out outliers in augmented unreplicated design
> outlier.box<- function(data, trait, name){ 
+   #test<-subset(data, Envi==envir )# subsset based on environment and replications
+   #test<-droplevels.data.frame(test) # drop factor levels
+   #var_name <- eval(substitute(var),eval(data))
+   trait_name<- eval(substitute(trait),eval(data)) # evaluate trait name
+   Q3 = quantile(trait_name, 0.75, na.rm = TRUE) # get Q3
+   Q1=quantile(trait_name, 0.25, na.rm = TRUE)
+   IQR=IQR(trait_name, na.rm = TRUE)
+   Maxi<-Q3+1.5*IQR # Maximum Value
+   Mini<-Q1-1.5*IQR # Minimum Value
+   #out_flag_max<-ifelse(trait_name >Maxi , "OUTLIER_Max", ".") # Flag lines with maximum value as OUTLIER_Max
+   #out_flag_min <-ifelse(trait_name < Mini , "OUTLIER_Min", ".") 
+   out_flag<-ifelse(trait_name >Maxi | trait_name < Mini , name, ".") # Flag the outliers
+   #out<-cbind(out_flag_max,out_flag_min)
+   out_data<-cbind(data, out_flag) # Combine the orginal data
+   #outliers<- data[which(out_data$out_flag_max!="." |out_data$out_flag_min!="." ), c(1, 2,4,7,15)] # Extract the ones with extreame values and return only selected columns
+   #outliers<- data[which(out_data$out_flag!="."),] # Extract the ones with extreame values and return only selected columns
+   return( out_data)
+ }
> 
> table(demo.data$Environment)

Non.stress.trial     Stress.trial 
             380              380 
> 
> # Now subset the data and use above function to identify the outliers
> 
> # subset
> Stress.trial<-subset(demo.data, Environment=="Stress.trial") # drought data
> Stress.trial<-droplevels.data.frame(Stress.trial) # drop factor levels
> # Now subset the non-stress data
> str(demo.data)
'data.frame':   760 obs. of  13 variables:
 $ Environment      : Factor w/ 2 levels "Non.stress.trial",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Abiotic.stress   : Factor w/ 1 level "Drought": 1 1 1 1 1 1 1 1 1 1 ...
 $ Year             : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ Plot             : int  196 156 8 123 331 330 95 113 205 269 ...
 $ Genotype         : Factor w/ 344 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Block            : Factor w/ 4 levels "1","2","3","4": 3 2 1 2 4 4 1 2 3 3 ...
 $ Replication      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Row              : Factor w/ 4 levels "1","2","3","4": 3 2 1 2 4 4 1 2 3 3 ...
 $ Column           : Factor w/ 95 levels "1","2","3","4",..: 6 35 8 68 50 51 95 78 15 79 ...
 $ Line.type        : Factor w/ 2 levels "check","entry": 2 2 2 2 2 2 2 2 2 2 ...
 $ Days.to.flowering: int  94 86 112 88 101 84 85 97 88 109 ...
 $ Height           : num  95.7 90.3 120.3 112.3 114 ...
 $ GYKGPHA          : num  5078 3533 5741 4902 5851 ...
> Non.stress.trial<-subset(demo.data, Environment=="Non.stress.trial")
> Non.stress.trial<-droplevels.data.frame(Non.stress.trial) # drop factor levels
> 
> # Now identify the outliers for grain yield
> Stress.trial<-outlier.box(Stress.trial,name="Outlier.GY", trait = GYKGPHA) # returns the list that has outliers for drought environment
> Non.stress.trial<-outlier.box(Non.stress.trial,name="Outlier.GY", trait = GYKGPHA) # returns the list that has outliers for non-stress environment
> 
> # Now identify the outliers for plant height
> Stress.trial<-outlier.box(Stress.trial,name="Outlier.PH", trait = Height) # returns the list that has outliers for drought environment
> Non.stress.trial<-outlier.box(Non.stress.trial,name="Outlier.PH", trait = Height) # returns the list that has outliers for non-stress environment
> 
> # Now identify the outliers for days to flowering
> Stress.trial<-outlier.box(Stress.trial,name="Outlier.FL", trait = Days.to.flowering) # returns the list that has outliers for drought environment
> Non.stress.trial<-outlier.box(Non.stress.trial,name="Outlier.FL", trait = Days.to.flowering) # returns the list that has outliers for non-stress environment
> 
> # Now merge all the files and save them
> demo.data.out<-rbind(Stress.trial, Non.stress.trial)
> #Here we will inspect all the outliers and filter the extreame ones.
> #First let us change the names of last two columns
> colnames(demo.data.out)[c(14,15,16)] <- c("out.flag.GY", "out.flag.PH", "out.flag.FL")
> # Visualize as table
> print_table(demo.data.out[, c(1, 6,11,12,13,14,15, 16)], editable = 'cell', rownames = FALSE, caption = htmltools::tags$caption("Table: Showing the list of outliers for grain yield, plant height and flowering date.",style="color:black; font-size:130%"), filter='top')

3.6 Filter outliers

  • In this section we will convert the outliers into the missing values for all the three traits and save the file for downstream analysis.
  • Users can also convert them into mean values. See the codes given below for more details.
> # For grain yield
> demo.data.out$GYKGPHA<- ifelse(demo.data.out$out.flag.GY==".", demo.data.out$GYKGPHA, NA)
> # For plant height
> demo.data.out$Height<- ifelse(demo.data.out$out.flag.PH==".", demo.data.out$Height, NA)
> # For plant height
> demo.data.out$Days.to.flowering<- ifelse(demo.data.out$out.flag.FL==".", demo.data.out$Days.to.flowering, NA)
> # We can also conver the outliers into mean values 
> #data<-data.frame(matrix())
> #env<- unique(TEST$Envi)
>   #for(i in 1:length(env)){
>     #data1<-TEST[which(TEST$Envi==env[i]),]
>     #data1$GYKGPHA <- ifelse(data1$out.all==".", data1$GYKGPHA, mean(data1$GYKGPHA))
>     #return(data1)
>     #data2<-rbind(data1, data)
>   #}

Box Plot after Removing Outliers

> # Now draw the box plot
> p1<-boxplot.yield<-myboxplot(demo.data.out,x=Environment,y=GYKGPHA)+
+   labs(title="",x="Environments", y = "Grain Yield")+
+    stat_compare_means(method = "anova", label.x = 1.6, label.y = 10000)
> #p1<-ggplotly(boxplot.yield)
> 
> # Now draw the box plot for flowering
> p2<-boxplot.flowering<-myboxplot(demo.data.out,x=Environment,y=Days.to.flowering)+
+   labs(title="",x="Environments", y = "Days to flowering")+
+    stat_compare_means(method = "anova", label.x = 1.6, label.y = 130)
> #p2<-ggplotly(boxplot.flowering)
> 
> # Now draw the box plot height
> p3<-boxplot.height<-myboxplot(demo.data.out,x=Environment,y=Height)+
+   labs(title="",x="Environments", y = "Plant Height (cm)")+
+    stat_compare_means(method = "anova", label.x = 1.6, label.y = 167)
> #p3<-ggplotly(boxplot.height)
> 
> par(mfrow=c(1,3))
> p1<-ggplotly(p1)
> p2<-ggplotly(p2)
> p3<-ggplotly(p3)
> subplot(p1, p2, p3, nrows=1, margin = 0.05, titleY = TRUE)

Box plot showing distribution for all traits.

Note: Seems much better now. Also check the significant differences between drought and non-stress trials.

Descriptive Statistics after Removing Outliers

> summary.gykgpha<-data.frame(demo.data.out %>% 
+   group_by(Environment)%>% 
+   summarize(Mean = mean(GYKGPHA, na.rm=TRUE),
+             Median= median(GYKGPHA, na.rm=TRUE),
+               SD =sd(GYKGPHA, na.rm=TRUE),
+               Min.=min(GYKGPHA, na.rm=TRUE),
+                Max.=max(GYKGPHA, na.rm=TRUE),
+                CV=sd(GYKGPHA, na.rm=TRUE)/mean(GYKGPHA, na.rm=TRUE)*100,
+                St.err= sd(GYKGPHA, na.rm=TRUE)/sqrt(length(GYKGPHA))
+                ))
> summary.gykgpha<-data.frame(lapply(summary.gykgpha, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> 
> summary.gykgpha<-cbind(data.frame(Trait=c(rep("Yield", nrow(summary.gykgpha)))),summary.gykgpha )
> # Summary for the flowering data
> summary.flowering<-data.frame(demo.data.out %>% 
+   group_by(Environment)%>% 
+   summarize(Mean = mean(Days.to.flowering, na.rm=TRUE),
+             Median= median(Days.to.flowering, na.rm=TRUE),
+               SD =sd(Days.to.flowering, na.rm=TRUE),
+               Min.=min(Days.to.flowering, na.rm=TRUE),
+                Max.=max(Days.to.flowering, na.rm=TRUE),
+                CV=sd(Days.to.flowering, na.rm=TRUE)/mean(Days.to.flowering, na.rm=TRUE)*100,
+                St.err= sd(Days.to.flowering, na.rm=TRUE)/sqrt(length(Days.to.flowering))
+                ))
> summary.flowering<-data.frame(lapply(summary.flowering, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> summary.flowering<-cbind(data.frame(Trait=c(rep("Flowering", nrow(summary.flowering)))),summary.flowering )
> # Summary for plant height
> 
> summary.height<-data.frame(demo.data.out %>% 
+   group_by(Environment)%>% 
+   summarize(Mean = mean(Height, na.rm=TRUE),
+             Median= median(Height, na.rm=TRUE),
+               SD =sd(Height, na.rm=TRUE),
+               Min.=min(Height, na.rm=TRUE),
+                Max.=max(Height, na.rm=TRUE),
+                CV=sd(Height, na.rm=TRUE)/mean(Height, na.rm=TRUE)*100,
+                St.err= sd(Height, na.rm=TRUE)/sqrt(length(Height))
+                ))
> 
> summary.height<-data.frame(lapply(summary.height, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> summary.height<-cbind(data.frame(Trait=c(rep("Height", nrow(summary.height)))),summary.height )
>  
> # Now combine the all data summeries and view as table
> summary.data<-rbind(summary.gykgpha, summary.flowering, summary.height)
> datatable(summary.data,options = list(pageLength = 7, dom = 'tip'), rownames = FALSE,caption = htmltools::tags$caption("Data summary after removing outliers.", style="color:black; font-size:130%"))
> 
> 
> # Save the file for analysis
>      write.csv(demo.data.out, file = "~/Documents/GitHub/Analysis-pipeline/Outputs/Tables/demo.data.filtered.csv", row.names = FALSE)        

4 Data Analysis for Grain Yield


  • In this section, we will analyze only grain yield data using a Mixed-Model Approach in Asreml-R package.
  • Analysis is divided in two parts:
  • First Separate analysis: In this stress and non-stress trials will be analyzed separately. We will be testing five mixed models correcting for experimental design factors (Blocks here) and spatial variations. Then best model will be selected (model having lowest AIC value ) and used to extract the BLUPs or Breeding Values and Heritability.
  • Second Combined analysis or MET analysis: In this analysis drought and non-stress trials will combined and analyzed together and single value BLUPs for each genotype will be extracted. Heritability will be also noted. The model used for combined analysis is again Mixed-Model accounting for spatial variations. We already know best spatial model (found in separate analysis above) in drought and non-stress trial, so this information will be used and incorporated in combined mixed-model analysis model.
  • Then we will rank the genotypes based on the BLUP values and compare it with checks.
  • We will also look at correlations between the trials.

4.1 Separate Analysis


Model 1

  • In this model we account for just experimental design factor Block and no spatial variation.

  • Note we used block as fixed effect in most cases due to less than 5 degrees of freedom. If you are interested to know whether to use block fixed or random in model I highly recommend this Blocks Fixed or Random?

  • Also Note row and block is same in all the trials. So it does not matter whether we use row or block in model.

  • Further, we use word environment or trial as synonymous, environment here is stress (drought) and non-stress trials.

  • Best linear unbiased predictors (BLUPs) extracted here is equivalent to breeding values


\[ Y_{ij}= \mu+G_{i} + B_{j} + \varepsilon_{ij}\\ Y_{ij}= \text{ is the effect of $i$th genotype in $j$th block} \\ \mu= \text {overall mean}\\ G_{i}=\text{effect of the $i$th genotype}\\ B_{k}= \text {effect of $k$th block}\\ e_{ij}=\text{error}\\ \text{here we assume residuals are independent and identically distributed }\varepsilon\sim \text{$iid$N}(0,\sigma_e^2)\\ \]

R script in Asreml

model1<-asreml(fixed=trait~Block, random=~Genotype, na.method=“include”, data=data)

  • Block as fixed effect and genotype as random effect

Model 2

  • In this model we account for just experimental design factor block, and column no spatial variation.

\[ Y_{ijk}= \mu+G_{i} + B_{j}+ C_{k} + \varepsilon_{ijk}\\ Y_{ijk}= \text{ is the effect of $i$th genotype in $j$th block and $k$th column} \\ \mu= \text {overall mean}\\ G_{i}=\text{effect of the $i$th genotype}\\ B_{j}= \text {efect of $j$th block}\\ C_{k}= \text {efect of $k$th column}\\ e_{ijk}=\text{error}\\ \text{here we assume residuals are independent and identically distributed }\varepsilon\sim \text{$iid$N}(0,\sigma_e^2)\\ \]

R script in Asreml

model2<-asreml(fixed=trait~1, random=~Column+Block+Genotype, na.method=“include”, data=data)

  • Block, column and genotype all are used as random effects.

Model 3

  • In this model we account block, and column effects, and for spatial variations (i.e, correlated residuals both across blocks or rows and columns) in row and column direction both.

\[ Y_{ijk}= G_{i} + B_{j}+ C_{k} + \varepsilon_{ijk}:ar1(B):ar1(C)\\ Y_{ijk}= \text{ is the effect of $i$th genotype and $j$th block in $k$th column} \\ \mu= \text {overall mean}\\ G_{i}=\text{effect of the $i$th genotype}\\ B_{k}= \text {efect of $k$th block}\\ C_{k}= \text {efect of $k$th column}\\ ar1(B):ar1(C)=\text{AR1_AR1 first order autoregressive variance model for both Block/Row and Column}\\ \] here, we assume residuals are correlated based on the distance between plots along both the rows and columns; that is \[\sim \sum{_B}(p{_B})\bigotimes\sum{_C}(p{_C})\] where,\[\sum{_B}(p{_B})\] is the correlation matrix for the row model \[(p{_rB})\] is the auto-correlation parameter in row direction, and \[\sum{_C}(p{_C})\] is the correlation matrix for the column model and \[(p{_C})\] the auto-correlation parameter in the column direction

R script in Asreml

model3<-asreml(fixed=trait~1, random=~Column+Block+Genotype,residual =~ar1v(Block):ar1(Column), na.method=“include”, data=data)

  • Block, column and genotype all are used as random effects.

Model 4

  • In this model we account for only spatial variations (i.e, correlated residuals both across blocks or rows and columns) in row and column direction both.

\[ Y_{ijk}= G_{i} + B_{j} + \varepsilon_{ij}:ar1(B):ar1(C)\\ Y_{ijk}= \text{ is the effect of $i$th genotype and $j$th block} \\ \mu= \text {overall mean}\\ G_{i}=\text{effect of the $i$th genotype}\\ B_{k}= \text {efect of $j$th block}\\ ar1(B):ar1(C)=\text{AR1_AR1 first order autoregressive variance model for both Block/Row and Column}\\ \] here, we assume residuals are correlated based on the distance between plots along both the rows and columns; that is \[\sim \sum{_B}(p{_B})\bigotimes\sum{_C}(p{_C})\] where,\[\sum{_B}(p{_B})\] is the correlation matrix for the row model \[(p{_rB})\] is the auto-correlation parameter in row direction, and \[\sum{_C}(p{_C})\] is the correlation matrix for the column model and \[(p{_C})\] the auto-correlation parameter in the column direction

R script in Asreml

model4<-asreml(fixed=trait~Block, random=~Genotype,residual =~ar1v(Block):ar1(Column), na.method=“include”, data=data)

  • Block is fixed and genotype as random effects.

Model 5

  • In this model we account for spatial variations (i.e, correlated residuals across columns) in column direction only.

\[ Y_{ijk}= G_{i} + B_{j}+ C_{k} + \varepsilon_{ijk}:ar1(B):ar1(C)\\ Y_{ijk}= \text{ is the effect of $i$th genotype and $j$th block in $k$th column} \\ \mu= \text {overall mean}\\ G_{i}=\text{effect of the $i$th genotype}\\ B_{k}= \text {efect of $k$th block}\\ C_{k}= \text {efect of $k$th column}\\ ar1(B):ar1(C)=\text{AR1_AR1 first order autoregressive variance model for both Block/Row and Column}\\ \] here, we assume residuals are correlated based on the distance between plots along both the rows and columns; that is \[\sim\bigotimes\sum{_C}(p{_C})\] where, \[\sum{_C}(p{_C})\] is the correlation matrix for the column model and \[(p{_C})\] the auto-correlation parameter in the column direction

R script in Asreml

model5<-asreml(fixed=trait~Block, random=~Genotype,residual =~idv(Block):ar1(Column), na.method=“include”, data=data))

  • Block as fixed effect and genotype as random effects.

4.1.1 Best Model for Grain Yield

  • Here we will first built a function to run all the five models.
  • These five models will be run separately on stress and non-stress data.
  • Best model will be selected based on AIC and residual plot information, and later used to extract BLUPs in stress and non-stress trials.
  • Please click on the code button on right-side to see the model with lowest AIC value.

Read data filtered for outliers and built function for running models


Best Model for Grain Yield Under Stress (drought)

> # Now run above function to test various models for both environments and traits
> # For grain yield under drought
> output.dr.gy<-my.asreml(demo.dr, trait = "GYKGPHA")
Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:44 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2668.402      502314.7    374 21:50:44    0.0
 2     -2668.002      476008.2    374 21:50:44    0.0
 3     -2667.653      439703.9    374 21:50:44    0.0
 4     -2667.523      404071.4    374 21:50:44    0.0
 5     -2667.521      399967.1    374 21:50:44    0.0

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:45 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2688.939      469860.0    377 21:50:45    0.0
 2     -2687.254      454285.9    377 21:50:45    0.0
 3     -2686.123      431663.6    377 21:50:45    0.0 (1 restrained)
 4     -2685.923      402638.0    377 21:50:45    0.0 (1 restrained)
 5     -2685.920      401761.4    377 21:50:45    0.0 (1 restrained)
 6     -2685.919      401882.1    377 21:50:45    0.0 (1 restrained)

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:46 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2686.069      473817.6    377 21:50:46    0.0
 2     -2678.059      427990.0    377 21:50:46    0.0 (1 restrained)
 3     -2673.146      400169.5    377 21:50:46    0.0
 4     -2670.451      369797.5    377 21:50:46    0.0
 5     -2669.512      357028.5    377 21:50:46    0.0
 6     -2669.377      355690.0    377 21:50:46    0.0
 7     -2669.374      354745.8    377 21:50:46    0.0

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:46 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2664.067      498943.8    374 21:50:46    0.0
 2     -2658.124      451894.6    374 21:50:46    0.0
 3     -2653.184      399017.6    374 21:50:46    0.0
 4     -2651.287      366035.0    374 21:50:46    0.0
 5     -2650.962      357963.7    374 21:50:46    0.0
 6     -2650.956      357267.3    374 21:50:46    0.0

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:47 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2660.754      487146.3    374 21:50:47    0.0
 2     -2656.900      451199.2    374 21:50:47    0.0
 3     -2653.431      403933.7    374 21:50:47    0.0
 4     -2651.880      366094.1    374 21:50:47    0.0
 5     -2651.550      350127.5    374 21:50:47    0.0
 6     -2651.541      347930.3    374 21:50:47    0.0
 7     -2651.541      347301.7    374 21:50:47    0.0

> # Extract the name of model that has lower AIC
> best.model.dr.gy<-colnames(output.dr.gy)[apply(output.dr.gy,1,which.min)]
> best.model.dr.gy
[1] "model5"

Click on code icon on right side to see which model is best


Best Model for Grain Yield Under Non-stress

> # For grain yield under non-stress
> output.ns.gy<-my.asreml(demo.ns, trait = "GYKGPHA")
Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:49 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2828.584   1.64524e+06    366 21:50:49    0.0
 2     -2826.920    1.4189e+06    366 21:50:49    0.0
 3     -2822.695        942757    366 21:50:49    0.0
 4     -2817.092        471797    366 21:50:49    0.0
 5     -2815.761        319798    366 21:50:49    0.0
 6     -2815.737        303774    366 21:50:49    0.0
 7     -2815.736        302510    366 21:50:49    0.0

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:50 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2849.835   1.53003e+06    369 21:50:50    0.0
 2     -2846.902     1.342e+06    369 21:50:50    0.0
 3     -2841.693        885955    369 21:50:50    0.0
 4     -2835.046        412289    369 21:50:50    0.0
 5     -2833.196        270715    369 21:50:50    0.0
 6     -2833.134        248188    369 21:50:50    0.0
 7     -2833.131        243678    369 21:50:50    0.0

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:51 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2841.938   1.50052e+06    369 21:50:51    0.0
 2     -2824.576   1.29677e+06    369 21:50:51    0.0 (1 restrained)
 3     -2811.044   1.02089e+06    369 21:50:51    0.0 (1 restrained)
 4     -2808.754   1.06205e+06    369 21:50:51    0.0 (2 restrained)
 5     -2808.204    1.1723e+06    369 21:50:51    0.0 (2 restrained)
 6     -2808.088   1.25569e+06    369 21:50:51    0.0 (2 restrained)
 7     -2808.071   1.28643e+06    369 21:50:51    0.0 (1 restrained)
 8     -2808.068   1.29973e+06    369 21:50:51    0.0
 9     -2808.067   1.30541e+06    369 21:50:51    0.0

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:51 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2818.405   1.58167e+06    366 21:50:51    0.0
 2     -2803.230   1.32999e+06    366 21:50:51    0.0 (1 restrained)
 3     -2795.443   1.14418e+06    366 21:50:51    0.0
 4     -2790.706   1.05562e+06    366 21:50:51    0.0
 5     -2789.671   1.18091e+06    366 21:50:51    0.0
 6     -2789.492    1.2885e+06    366 21:50:51    0.0
 7     -2789.471   1.32661e+06    366 21:50:51    0.0
 8     -2789.467   1.33733e+06    366 21:50:51    0.0
 9     -2789.466   1.34306e+06    366 21:50:51    0.0

Model fitted using the gamma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:52 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -2815.882   1.55052e+06    366 21:50:52    0.0
 2     -2802.743   1.32043e+06    366 21:50:52    0.0
 3     -2793.020   1.06478e+06    366 21:50:52    0.0
 4     -2790.761   1.10752e+06    366 21:50:52    0.0
 5     -2790.497    1.2309e+06    366 21:50:52    0.0
 6     -2790.470   1.26827e+06    366 21:50:52    0.0
 7     -2790.465   1.28412e+06    366 21:50:52    0.0
 8     -2790.465    1.2905e+06    366 21:50:52    0.0

> # Extract the name of model that has lower AIC
> best.model.ns.gy<-colnames(output.ns.gy)[apply(output.ns.gy,1,which.min)]
> best.model.ns.gy 
[1] "model5"
> best.model.ns.gy
[1] "model5"

Click on code icon on right side to see which model is best


<>

4.1.2 Extract BLUPs and Heritabilities

  • Here in this section we will select and run the best model and extract BLUPs (know more on BLUPs or BLUEs here).

  • We will also calculate the heritabilities. Note we are dealing with trials that is un-replicated and has missing data, so we cannot use basic formula as: \(h{^2}= \frac{\sigma^{2}g}{\sigma^{2}g+\sigma^{2}e}\) to calculate heritability. Plus when we are dealing with spatial models or complex models, calculating heritability with this method is not appropriate.

  • Alternative method as described by Piepho and M€ohring (2007) is more appropriate for complex residual structures and unbalanced experimental designs. The equation is: \(H_{C}=1-\frac{\overline{V}_{BLUP}}{2\sigma^{2}g}\). Where \(\overline{V}_{BLUP}\) is mean variance difference of difference of two BLUP and \(\sigma^{2}g\) is variance of genotypes. Note this definition of heritability is related to reliability of breeding value predictions. For more details please check the Book: Genetic Data Analysis for Plant and Animal Breeding; Chapter 7 and this beautiful resource Summary of heritability equations

  • So in this section a developed function called my.blup which will be used to extract BLUPs and then heritability will be calculated by method described above.

> # Now select the best model to extract BLUPs for each trait and environment
> # First we will build again a function to extract BLUPs and heritability from best model
> my.blup<-function(model, data){
+   #p<-plot(varioGram(model))
+   # Now use predict function to return the list of three containing predicted values, and average S.E differnces
+   predicted.values<-predict(model, "Genotype", sed=T)
+   # Extract the BLUPs from above
+   blups<-predicted.values$pvals
+   # Now let us add the line designation names
+   # BLUPs with line names 
+   #blups<-merge(data[,c(7,8,13,14)],blups, by="Genotype")
+   #blups<-blups[!duplicated(blups$Genotype), ]
+   # Calculate the heritability
+   # Simply based on the varaince componnets
+   #heritability<-vpredict(model5, hA ~  V1/(V1 + V2+V3+V4+V5))
+   #H2<-heritability[1,1]*100
+   #the Reliazied heritability that is appropriate for complex residual structures and unbalanced experimental designs introduced by Cullis et al. (2006) and discussed by Piepho and M€ohring (2007):
+   # page 235
+   # First let us extract the vBLUp differnce
+   avgsd<-predicted.values$avsed[2]
+   h2<- (1-((predicted.values$avsed[2])^2/((summary(model)$varcomp[1,1])*2)))*100
+   return(list(Heritability=h2, BLUPs=blups))
+ }
> 
> # Now for grain yield under drought
> 
> best.model.dr.gy
> model5.gy.dr<-asreml(fixed=GYKGPHA~Block, random=~Genotype,
+                residual =~idv(Block):ar1(Column),  na.method="include", data=demo.dr)
> # BLUPs and heritability for grain yield under stress
> out.gy.dr<-my.blup(model5.gy.dr, demo.dr)
> out.gy.dr$Heritability
> blups.dr.gy<-out.gy.dr$BLUPs
> names(blups.dr.gy)[c(2,3)]<-c("blups.gy", "std.er.gy")
> 
> # Now for grain yield under non-stress
> best.model.ns.gy
> model5.gy.ns<-asreml(fixed=GYKGPHA~Block, random=~Genotype,
+                residual =~idv(Block):ar1(Column),  na.method="include", data=demo.ns)
> # BLUPs and heritability for grain yield under drought
> out.gy.ns<-my.blup(model5.gy.ns, demo.ns)
> out.gy.ns$Heritability
> blups.ns.gy<-out.gy.ns$BLUPs
> # rename the columns and select appropriate columns
> names(blups.ns.gy)[c(2,3)]<-c("blups.gy", "std.er.gy")
> 
> # Now let us combine all the BLUPs dataframes into one and save
> # Let us add stress information column first
> 
> blups.dr<-data.frame(cbind(data.frame(Stress=c(rep("Drought",nrow(blups.dr.gy)))), blups.dr.gy))
> # Now add line.type information
> blups.dr<-merge(demo.dr[,c(5,10)],blups.dr, by="Genotype")
> blups.dr<-blups.dr[!duplicated(blups.dr$Genotype), ]
> 
> # Now combine non-stress
> blups.ns<-data.frame(cbind(data.frame(Stress=c(rep("Non-stress",nrow(blups.ns.gy)))), blups.ns.gy))
> # Now add the designation name and line.type
> blups.ns<-merge(demo.ns[,c(5,10)],blups.ns, by="Genotype")
> blups.ns<-blups.ns[!duplicated(blups.ns$Genotype), ]
> # Now combine all
> blups.all<-rbind(blups.dr[,-7], blups.ns[,-7])
> # Round all the columns containing blups and standard errors
> blups.all<-data.frame(lapply(blups.all, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> # Save the blups in the directory
> write.csv(blups.all, 
+           file="~/Documents/GitHub/Analysis-pipeline/Outputs/Tables/blups.all.seperate.csv",
+           row.names = FALSE)

Summary and Heritability for Grain Yield

> # Calcualate summary and heritability
> # Save heritability as vector
> Heritability<-c(out.gy.dr$Heritability, out.gy.ns$Heritability)
> # 
> summary.gy<-cbind(data.frame(blups.all%>% 
+         group_by(Stress)%>% 
+   summarize(Mean = mean(blups.gy, na.rm=TRUE),
+             Median= median(blups.gy, na.rm=TRUE),
+               SD =sd(blups.gy, na.rm=TRUE),
+               Min.=min(blups.gy, na.rm=TRUE),
+                Max.=max(blups.gy, na.rm=TRUE))
+                ),Heritability) 
> # Round 
> summary.gy<-data.frame(lapply(summary.gy, function(y) if(is.numeric(y)) round(y, 2) else y)) 
> 
> # Plot the data.tables
> print_table(summary.gy, rownames = FALSE,caption = htmltools::tags$caption("Data summaries of BLUPs in stress (drought) and non-stress trials for Grain Yield ", style="color:black; font-size:130%"))

Note: Heritability under drought and non-stress is very close.

BLUPs for Grain Yield

> # BLUPs table
> print_table(blups.all[, c(1,3,4,5,6)],editable = 'cell', rownames = FALSE,caption = htmltools::tags$caption("BLUPs along with standard errors for grain yield in drought and non-stress trials.", style="color:black; font-size:130%"), filter = 'top')

4.2 Combined or MET analysis

  • Here in this section Combined analysis/ multi-environment analysis for drought and non-stress trials will be performed and single BLUP values for each genotype will be predicted. We will also calculate combined heritability.

  • With the separate analysis done above we know which best spatial model works in each trial. We will directly borrow this information and incorporate into our combined model analysis, so we do not need to test various models.

*Click button below for more model description:

Combined or MET model *** \[ Y_{ijk}= \mu+G_{i} + E_{j}+G_{i}*E_{j}+ B_{k}(E_{j}) + \varepsilon_{ijk}:ar1(B):ar1(C)\\ Y_{ijk}= \text{ is the effect of $i$th genotype in $j$th environment and $k$th block} \\ \mu= \text {overall mean}\\ G_{i}=\text{effect of the $i$th genotype}\\ E_{j}=\text{effect of the $j$th environment}\\ B_{k}= \text {efect of $k$th block nested in $j$th environment}\\ e_{ijk}=\text{error}\\ ar1(B):ar1(C)=\text{AR1_AR1 first order autoregressive variance model for both Block/Row and Column}\\ \] here, we assume residuals are correlated based on the distance between plots along both the rows and columns; that is \[\sim \sum{_B}(p{_B})\bigotimes\sum{_C}(p{_C})\] where,\[\sum{_B}(p{_B})\] is the correlation matrix for the row model \[(p{_rB})\] is the auto-correlation parameter in row direction, and \[\sum{_C}(p{_C})\] is the correlation matrix for the column model and \[(p{_C})\] the auto-correlation parameter in the column direction. Further it should be noted we applied a separate spatial variation for each environment in the combined model. See the Asreml code below:

R script in Asreml for grain yield

met.gy<-asreml(GYKGPHA ~1,random= ~Genotype +Environment:Genotype+Block:Environment,residual = dsum(idv(Block):ar1(Column)+ ~idv(Block):ar1(Column)|Environment,levels = list(c(1), c(2))), na.method =“include”, data = crurrs.data.out)

  • Genotype, environment and block were all treated as random effects .

> # First we will read the filtered data set contianing both drought and non-stress data. 
>  if(exists('demo.data.out') && is.data.frame(get('demo.data.out'))){
+     demo.data.out=demo.data.out
+   }else{
+     demo.data.out<-read.csv(file="~/Documents/GitHub/Analysis-pipeline/Outputs/Tables/demo.data.filtered.csv",
+                             header = TRUE) 
+   }
> # In case checks are used as fixed effects  
> # Create two new columns if design is augmenetd. 
> # Adding a new column 'new' that will help treat genotypes as separate
> #demo.data.out$Genotype<-as.numeric(demo.data.out$Genotype)
> #demo.data.out<- within(demo.data.out,{
> #new <- ifelse(demo.data.out$Line.type=="check", 0, 1)
>   #})
>   # Adding a new column 'Genotypec' that will help us group all the new entries
>   # in a single pool, yet treat all checks as separate
>   #demo.data.out<- within(demo.data.out, {
>    # Genotypec <- ifelse(demo.data.out$new > 0, 999, demo.data.out$Genotype)
>   #})
>   
> # Arrange the the data set before running it
> demo.data.out<-data.frame(demo.data.out%>% group_by(Environment)%>%arrange(Row, Column))
> demo.data.out<-demo.data.out%>% arrange(Environment)
> columns<-c("Plot", "Genotype", "Replication", "Block", "Row", "Column", "Year")
> demo.data.out[, columns]<-lapply(columns, function(x) as.factor( demo.data.out[[x]]))
> demo.data.out$GYKGPHA<-as.numeric(demo.data.out$GYKGPHA)

Click on code on right-side to see detailed models and how heritability and BLUPs are extracted

> # Here we will perform combined analysis of data, by combining drought and non-stress
> # Spatial variation model will be used, model will be selected based on previous analysis done seperately
> # For GRAIN YIELD
>   met.gy<-asreml(GYKGPHA ~1,random= ~Genotype +Environment:Genotype+Block:Environment,
+           residual =~dsum(~idv(Block):ar1(Column)+idv(Block):ar1(Column)|Environment,levels = list(c(1), c(2))), na.method ="include", data = demo.data.out)
Multi-section model using the sigma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:55 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -6232.751           1.0    747 21:50:55    0.2
 2     -6011.516           1.0    747 21:50:55    0.1
 3     -5771.812           1.0    747 21:50:55    0.1
 4     -5613.425           1.0    747 21:50:55    0.1
 5     -5528.576           1.0    747 21:50:55    0.1
 6     -5502.850           1.0    747 21:50:55    0.1
 7     -5498.699           1.0    747 21:50:55    0.0
 8     -5498.466           1.0    747 21:50:55    0.0
 9     -5498.462           1.0    747 21:50:55    0.0
>   
>   summary(met.gy)$varcomp
                                           component    std.error  z.ratio
Block:Environment                       2.597734e+06 1.404766e+06 1.849228
Genotype                                4.405437e+04 4.160899e+04 1.058770
Environment:Genotype                    2.018192e+05 6.787976e+04 2.973186
Environment_Stress.trial!R              1.000000e+00           NA       NA
Environment_Stress.trial!Block          3.311190e+05 5.791122e+04 5.717700
Environment_Stress.trial!Column!cor     4.837215e-01 9.184277e-02 5.266844
Environment_Non.stress.trial!R          1.000000e+00           NA       NA
Environment_Non.stress.trial!Block      1.592149e+06 1.778096e+05 8.954236
Environment_Non.stress.trial!Column!cor 5.063207e-01 5.412508e-02 9.354642
                                        bound %ch
Block:Environment                           P 0.1
Genotype                                    P 0.1
Environment:Genotype                        P 0.2
Environment_Stress.trial!R                  F 0.0
Environment_Stress.trial!Block              P 0.1
Environment_Stress.trial!Column!cor         U 0.1
Environment_Non.stress.trial!R              F 0.0
Environment_Non.stress.trial!Block          P 0.1
Environment_Non.stress.trial!Column!cor     U 0.0
>   #aic<- -2*(model.met2$loglik-length(model.met2$vparameters));aic
>   predicted.gy<-predict(met.gy, "Genotype", sed=T)
Multi-section model using the sigma parameterization.
ASReml 4.1.0 Fri Jul 24 21:50:55 2020
          LogLik        Sigma2     DF     wall    cpu
 1     -5498.462           1.0    747 21:50:55    0.4
 2     -5498.462           1.0    747 21:50:55    0.1
 3     -5498.462           1.0    747 21:50:55    0.3
>   # Extract the BLUPs from above
>   blups.gy.met<-predicted.gy$pvals
>   names(blups.gy.met)[c(2,3)]<-c("blups.gy", "std.er.gy")
>   # Now calculate heritability
>   h2.gy.met<- (1-((predicted.gy$avsed[2])^2/((summary(met.gy)$varcomp[2,1])*2)))*100;h2.gy.met
    mean 
12.04419 
> 
> # Now add designation and line.type to blup file
>  # Now add the genotype name and line.type
> blups.met<-merge(demo.dr[,c(2,5,10)],blups.gy.met[,-4], by="Genotype")
> blups.met<-blups.met[!duplicated(blups.met$Genotype), ] 
> blups.met<-data.frame(lapply(blups.met, function(y) if(is.numeric(y)) round(y, 2) else y))

BLUPs for grain yield from combined analysis

> # BLUPs table
> print_table(blups.met[, c(1,3,4,5)],editable = 'cell', rownames = FALSE,caption = htmltools::tags$caption(" Combined BLUPs along with standard errors for grain yield ", style="color:black; font-size:130%"))
> # Save the blup file
> write.csv(blups.met, 
+           file="~/Documents/GitHub/Analysis-pipeline/Outputs/Tables/blups.combined.csv",
+           row.names = FALSE)

Combined Data Summary and Heritabilty

> summary.met.gy<-data.frame(blups.met%>% 
+         group_by(Line.type)%>% 
+   summarize(Mean = mean(blups.gy, na.rm=TRUE),
+             Median= median(blups.gy, na.rm=TRUE),
+               SD =sd(blups.gy, na.rm=TRUE),
+               Min.=min(blups.gy, na.rm=TRUE),
+                Max.=max(blups.gy, na.rm=TRUE),
+             Heritability=h2.gy.met)
+                )
> summary.met.gy<-data.frame(lapply(summary.met.gy, function(y) if(is.numeric(y)) round(y, 2) else y))
> summary.met.gy[1,7]<-"-"
> 
> print_table(summary.met.gy, rownames = FALSE)

Note: Heritability of grain yield reduced drastically, shows the impact of drought, and was evident from raw data. Also, check the mean and higher value of genotype as compared to checks.

5 Ranking of Genotypes and Correlations

  • Here in this section we will rank the genotypes and select top 10% of genotypes based on the combined analysis and plot it as bar plot.

  • We will also compare the rankings of genotypes based on BLUPs obtained in stress (drought), non-stress and combined analysis. We will see which genotypes are common in top 10% of lines all the three. We will save it in data.frame and also plot Venn Diagram.

  • We will also check the correlations between BLUPs in drought, non-stress and combined one.

5.1 Ranking of BLUPs

  • Ranking of genotypes based the BLUPs extracted in Combined analysis
> # Ranking and selection of top performing lines
> # Subset only entries
> blups.met.Genotype<-subset(blups.met, Line.type=="entry")
> # Get mean of entries and checks
> Genotype.mean<-mean(blups.met.Genotype$blups.gy)
> check.mean<-mean((subset(blups.met, Line.type=="check"))$blups.gy)
> # Arrange the BLUPs in decreasing order
> blups.met.Genotype<-blups.met.Genotype%>%arrange(desc(blups.gy))
> # Select top 35 and merge with checks
> blups.top25<-data.frame(rbind((blups.met.Genotype[1:35, ]), (subset(blups.met, Line.type=="check"))))
> blups.top25<-droplevels.data.frame(blups.top25)
> # make factor unique to keep order of entries on plot
> blups.top25$Genotype <- factor(blups.top25$Genotype, levels=unique(blups.top25$Genotype))
> # Draw the plot
>   bar.plot<-ggplot(data=blups.top25, aes(x=Genotype, y=blups.gy, fill=Line.type)) +
+   geom_bar(stat="identity", width=0.5)+
+   theme_classic()+
+     labs(title="BLUPs of Top Ranked Genotypes along with Checks",x="Genotypes", y = "BLUP Value")+
+     #scale_y_continuous(limits = c(0, 6000), breaks = seq(0, 6000, by = 500))+
+     theme (plot.title = element_text(color="black", size=1, face="bold", hjust=0),
+            axis.title.x = element_text(color="black", size=10, face="bold"),
+            axis.title.y = element_text(color="black", size=10, face="bold")) +
+     theme(axis.text= element_text(color = "black", size = 8))+
+     geom_segment(aes(x = 1, y = Genotype.mean, xend = 35, yend =Genotype.mean), color="darkred", 
+                  linetype="dashed", size=1)+
+     geom_segment(aes(x = 36, y = check.mean, xend = 47, yend =check.mean), color="darkblue", 
+                  linetype="dashed", size=1)+
+     theme(axis.text.x = element_text(angle = 90, hjust = 1))
>   ggplotly(bar.plot)

Bar plot showing the BLUPs for top 10% of genotypes and all the checks. Dased lines shows overall mean of all genotypes and checks. Genotypes differ slightly from checks and mean of entries and checks are almost close.

5.2 Correlation Between BLUPs

  • Here checking the correlation of BLUPs obtained in drought, non-stress and combined analysis
> 
> # BLUPs in drought
>   blups.dr<-subset(blups.all, Stress=="Drought", select =c(1,4))
>   colnames(blups.dr)<-c("Genotype", "BLUPs.drought")
> # Blups in non-stress data
>   blups.ns<-subset(blups.all, Stress=="Non-stress", select =c(1,4))
>   colnames(blups.ns)<-c("Genotype", "BLUPs.non-stress")
>   
> # now combined blups
>   blups.com<-blups.met[, c(1,2,4)]
>   colnames(blups.com)<-c("Genotype", "B4R.designation", "BLUPs.combined")
> # Merge all the BLUPs
> blups.com.all<-merge((merge(blups.dr, blups.ns, by="Genotype")), (blups.com), by="Genotype")
> corr.blup <- data.frame(round(cor(blups.com.all[,-c(1,4)]), 2))
> print_table(corr.blup, rownames = TRUE, caption = htmltools::tags$caption("Correlation of BLUPs obtained in seperate analysis for drought, non-stress and in combined analysis.", style="color:black; font-size:130%"))

Note: Correlations between drought and non-stress seems extremely low, which may be due high effect of drought, but both show high correlation with combined BLUPs. This may be reason of very low heritability in the combined analysis

5.3 Venn Diagram of Top lines

  • Here we will look at the lines that are common in top 10% of lines in drought, non-stress and combined analysis.
> # Combined blups
> com.blups.top<-subset(blups.met, Line.type=="entry")
> com.blups.top<-com.blups.top%>%arrange(desc(blups.gy))
> com.blups.top<-com.blups.top[1:35,]
> colnames(com.blups.top)[1]<-"Genotype.com"
> 
> # Blups in drought
> blups.dr<-subset(blups.all, Stress=="Drought", select =c(1,4))
> blups.dr.top<-blups.dr%>%arrange(desc(blups.gy))
> blups.dr.top<-blups.dr.top[1:35,]
> colnames(blups.dr.top)[1]<-"Genotype.dr"
> # Blups in non-stress
> blups.ns<-subset(blups.all, Stress=="Non-stress", select =c(1,4))
> blups.ns.top<-blups.ns%>%arrange(desc(blups.gy))
> blups.ns.top<-blups.ns.top[1:35,]
> colnames(blups.ns.top)[1]<-"Genotype.ns"
> # Now cbinb all the required columns
> 
> data.venn<-data.frame(cbind(Combined=com.blups.top$Genotype.com, Drought=blups.dr.top$Genotype.dr, Non.stress=blups.ns.top$Genotype.ns))
> library(RColorBrewer)
> myCol <- brewer.pal(3, "Pastel2")
> P<-venn.diagram(
+   x = list(data.venn$Combined, data.venn$Drought, data.venn$Non.stress),
+   category.names = c("Combined.BLUPs" , "Drought.BLUPs " , "Non.Stress.BLUPs"),
+   filename = '~/Documents/GitHub/Analysis-pipeline/Codes/14_venn_diagramm.png',
+   output=TRUE,
+   # Output features
+   imagetype="png" ,
+   height = 1200 , 
+   width = 1200 , 
+   resolution = 500,
+   # Circles
+   lwd = 2,
+   lty = 'blank',
+   fill = myCol,
+   
+   # Numbers
+   cex = .6,
+   fontface = "bold",
+   fontfamily = "sans",
+   
+   # Set names
+   cat.cex = 0.2,
+   cat.fontface = "bold",
+   cat.default.pos = "outer",
+   cat.pos = c(-27, 27, 135),
+   cat.dist = c(0.055, 0.055, 0.085),
+   cat.fontfamily = "sans",
+   rotation = 1
+   
+ )
> P
[1] 1

Venn diagram showing list of lines that are common between separate analysis in drought and non-stress trials, and combined analysis of stress and non-stress trials. Seven top ranking genotypes were found common in drought, non-stress and combined analysis and can be used for selection across normal and drought conditions

5.4 List of lines that are common in all the three

> overlap <- calculate.overlap(
+     x = list(data.venn$Combined, data.venn$Drought, data.venn$Non.stress)
+ )
> datatable(t(overlap$a5), rownames = TRUE, caption = htmltools::tags$caption("List of entries that are common between drought, non-stress separate analysis, and in combined data analysis", style="color:black; font-size:130%"))

Table showing list of entries (genotype number is shown) that are common in all the three.

6 Final Remarks

  • Models accounting for spatial variations were more appropriate than models accounting just for experimental design factors.
  • Significant differences between drought and non-stress trials indicate effect of drought treatment.
  • Heritability of combined stress and non-stress analysis is very low showing high effect of drought.
  • Combined analysis performed would be more appropriate to rank lines, as done above.

Note: For questions specific to analysis please contact


If your experiment needs a statistician, you need a better experiment - Ernest Rutherford